The R Data Analysis System
Graphics
This document will focus on some of the
graphical
procedures available in R.
R’s graphics
capabilities are extensive, but it cannot be said that they are user
friendly.
R actually has two graphics
packages, a basic
package and a fancier package called lattice graphics.
Everything you will want to do right
now can be accomplished with the basic package. This section
describes only the most often used graphics functions. Much more
information about these functions and others is available in the R help
files.
Stemplots
Stemplots are also
called stem
and leaf diagrams.
In R they are produced
in the console window by the
command
>
stem(data)
where data
is the name of the numeric vector you want to plot.
The
stem
function has optional arguments for controlling how many stems and
leaves you see.
The default values
of these
arguments are usually satisfactory.
Stemplots are not very useful for large data
sets.
Boxplots
A boxplot is also called a box and whisker
diagram. It visually depicts the five number summary of a numeric
data set, i.e., the minimum, the maximum, and the quartiles. It
also shows outliers. To make a boxplot of a vector called
data,
type
> boxplot(
data)
The boxplot function has optional arguments for
controlling the page layout of the plot and fine details of the boxes
and whiskers in the plot. When you are ready to experiment with
them, you can read about these options by calling
> help(boxplot)
Side-by-side boxplots are useful for comparing
the distributions of several data vectors. If
data1,
data2,
data3, etc. are the names of the data vectors, a side-by-side
boxplot can be produced by
> boxplot(
data1,
data2,
data3)
Side by side boxplots of data grouped by levels
of a factor can be produced using formula syntax, as in
> boxplot(
data~
factor)
where
data and
factor are two
variables in the same data frame.
An optional argument that is sometimes helpful
is the logical
varwidth argument. It adjusts the width of
each box to reflect the sample size of its data set. It is
invoked as follows.
> boxplot(
data1,
data2,
data3,
varwidth=T)
Quantile Plots
Suppose a numeric data vector is a sample of size n from
some distribution. A normal quantile plot of the data compares
the order statistics (ordered data values) to the expected order
statistics from a standard normal distribution. The vertical
coordinates of the points are the ordered data values and the
horizontal coordinates are the expected standard normal order
statistics. If the data is a sample from the normal distribution
with mean µ and standard deviation
σ, the points of the normal quantile plot will lie close to a straight
line with intercept µ and slope
σ. To make a normal quantile plot, type
> qqnorm(
data)
> qqline(
data)
The second command above draws a line through the
points whose coordinates are the quartiles on each axis. This
line helps you assess the straightness of the set of points, and thus
the departure from normality of the data.
There is another type of quantile plot for
comparing the order statistics from two independent samples to assess
whether or not they come from the same distribution. The two
samples do not have to be of the same size. If the data from the
two samples are the vectors
data1 and
data2, the command is
> qqplot(
data1,
data2)
The interpretation of this plot is similar to that
produced by
qqnorm. If the two samples are from
distributions that differ only in location and scale the points of the
plot should be close to a straight line. Reliable interpretation of
quantile plots takes practice.
Histograms
A basic, no-frills histogram of a numeric vector
data can
be produced with the command
> hist(
data)
There are lots of optional arguments to the
hist function
for controlling the colors and appearance of the bars in the histogram,
the titles, and the axis labels. The default bins or class
intervals are chosen on the basis of the range of the data and the size
of the data set and are always of equal length. The argument
breaks
allows you to specify your own class intervals.
breaks
is an increasing numeric vector that gives the end-points of the
intervals. For example, to make a histogram with bins (0,1],
(1,2], ...., (9,10) use
> hist(
data, breaks=0:10)
To make one with bins (0,2], (2,6], (6,8], (8,10)
> hist(
data, breaks=c(0,2,6,8,10))
Notice that the bin widths are not all the same in the last example.
By default, the vertical scale of the bars of an R histogram shows
the counts of the data values that fall in each of the class
intervals. Histograms are often used as approximations of a
density function, and in that role they should be density functions
themselves. That is, the sum of the areas of the histogram bars
should be 1. This can be accomplished by setting the logical
probability
argument to True, as in
> hist(
data, prob=T)
With bins of equal length, the shape of the histogram is the same
when
prob = T as when
prob = F. The only difference is in
the scale of the vertical axis.
Bar Charts and Pie Charts
Bar charts are different from histograms.
Histograms are for numeric data whereas bar charts and pie charts are
for counts of levels of a factor. Usually, the counts must be
tabulated first with the
table function.
>
counts = table(
factor)
This produces a vector of the counts of the
various levels of the factor variable
factor. This is a
vector with named components, the names being the factor levels.
The bar chart is then given by
> barplot(
counts)
and the pie chart by
> pie(
counts)
Optional arguments to
barplot allow
horizontal bars, multicolored bars, stacked bars, labels and legends.
Scatterplots
and Linear Regression Lines
Scatterplots or scatter diagrams are probably encountered more than
any other kind of plot in elementary statistics. A scatterplot is
just a plot of a finite set of points (x
i, y
i) in
a cartesian plane. The usual reason for doing a scatterplot is to
conjecture or investigate a noisy functional relationship between y and
x. If the values of x are in a numeric vector
xdata and
the values of y in a numeric vector
ydata the command
> plot(
xdata,
ydata)
gives you the scatterplot. It is important that the lengths of
xdata and
ydata be the same. If they are not, you
will get an error message. If
xdata and
ydata are
the columns of a two-column matrix or data frame
xyframe, the
plot is even easier.
> plot(
xyframe)
Sometimes it is more convenient to use formula syntax to create a
scatterplot.
> plot(
ydata~
xdata)
If either
xdata or
ydata is non-numeric the
plot
function gives a different kind of plot, depending on the character of
the variables. All of these kinds of plots are useful in
different ways.
In
order to superimpose the least squares regression line on a
scatterplot, you must first calculate the regression
coefficients. There are lots of ways of doing this. The
most general and useful is with the
lm (for linear model) function. First, create the fitted
linear model with a command such as
>
xymodel=lm(
ydata~
xdata,data=
xyframe)
The
argument "data =
xyframe" is needed only if
xdata and
ydata
are variables in a data frame
xyframe and you need to tell
lm
where they are. If they are variables in the top level of your
workspace you don't need this argument. After creating the fitted
model, the regression line is superimposed on the scatterplot as
follows.
>
plot(
ydata~
xdata)
>
abline(coef(
xymodel))
The
function
coef extracts the least squares regression
coefficients from the fitted model object.
abline is a
generic function for adding a straight line to an existing plot.
Multiple Scatterplots
You can get simultaneous scatterplots of all the
pairs of variables in a data frame by either of the commands
> pairs(
dataframe)
or
> plot(
dataframe)
This is a very good way to look at how several
variables interact in pairs. Variables in the data frame that are
non-numeric factors are treated as numeric. Thus, pairs involving
factors may not tell you anything useful.
3-D Scatterplots
Three dimensional scatterplots c
annot be
produced with the basic graphics package. The best way to do them
is through R Commander. You load the R Commander package by
> library(Rcmdr)
It takes a few seconds for the package to be
loaded and for a window to
be opened on Commander. Suppose
that in your R workspace you have a data frame named dataframe,
with variables x1, x2 and y, among other
variables. First, make dataframe the active data set in R
Commander by clicking on the button labelled "Data set" and then
selecting dataframe from the menu. After that, click on
the "Graphs" drop down menu and select "3D Graphs - 3D
Scatterplot". In the dialog box that pops up, choose y as
the dependent variable and x1 and x2 as the independent
variables. Select or deselect any options you want or don't want
and then click "OK". The three dimensional scatterplot can be
rotated with your mouse, points can be identified, and the picture can
be saved to an external file.
Smooth Curves
Suppose
funct is the name of a
previously defined function of one numeric variable. It can be
either a built-in R function or a function that you have defined
yourself. To plot the graph of the function from the lower limit
a
and the upper limit
b, type
> curve(
funct, from=
a, to=
b)
For example, the sine function is graphed by
> curve(sin, from=-pi, to=pi)
There are optional arguments for adjusting
certain features of the plot. If the plotted function has a
simple formula, the formula can be used in place of the name of a
function. This eliminates the need to define the function prior
to calling
curve. For example,
> curve(-2*x^2+1,from=-2,to=2)
plots part of the parabola y = -2x
2+1.
You can superimpose a curve on a pre-existing plot by using the logical
"add" argument to the
curve function.
> curve(-2*x^2+1, add=T)
Notice that the
from and
to
arguments were not used here because, presumably, you want the added
curve to extend across the range of the previous plot.
Copying and Saving Plots
To copy and paste a plot into another
application, such as a Microsoft Word document, right-click in the plot
area and from the popup menu select either "Copy as metafile" or "Copy
as bitmap". Then you may simply paste it into your other
application. To save a plot, right-click again and select "Save
as metafile" or "Save as bitmap". These can be converted to jpeg
or gif files externally if you like. You may find resolution to
be improved by resizing the plot inside R before copying it. The
menu also allows you to print the plot to a printer or to pdf format,
if you have the right software.